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Random Forest, Logistic Regression, Xtreme Gradient, and 
AdaBoost Classifier are trained on the Breast Cancer 
Wisconsin Diagnostic dataset, and their efficacy is assessed 
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identify the most effective ensemble and machine learning 
classifiers for breast cancer detection and diagnosis in terms 
of Accuracy and AUC Score. 
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1. INTRODUCTION 


Breast cancer is defined as any kind of malignant tumor that arises in the breast as shows in Figure 1. It 
affects around 10% of all women at some point in their lives, making it the most common kind of cancer in 
women. Breast cancer is the second biggest cause of mortality among women, behind lung cancer (after lung 
cancer). After lung cancer, breast cancer is the second leading cause of death for women (after lung cancer). 
In the US, invasive breast cancer is anticipated to afflict 246,660 women in 2016, and the illness is 
anticipated to claim the lives of 40,450 women. The only approaches to stop this tumor from spreading are 
early discovery and prompt treatment. From 6.2 million cases in 2000 to 10 million cases in 2020, cancer 
mortality has similarly grown [1]. One in every six deaths is brought on by cancer. This demonstrates how 
crucial it is to provide funding for both the cancer battle and cancer prevention. Information and 
communication technology (ICT) must be employed effectively in medical practice in order to modernize 
the healthcare system, and particularly cancer therapy. Actually, the amount of data and the amount of value 
that can be derived from it have changed as a result of big data. Big data has greatly changed corporate 
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intelligence by analyzing a massive amount of unstructured, varied, non-standard, as well as healthcare data 
(BI). Since it not only predicts but also assists in decision-making, it is usually seen as a breakthrough in 
continuing innovation with the objective of enhancing patient care quality and reducing healthcare costs [2]. 


Figure 1. It shows the diagnosis process of the breast cancer [3]. 


Many machine learning approaches may be used to detect and diagnose breast cancer. Other machine 
learning approaches include Random Forest, Logistic Regression, Xtreme Gradient, and AdaBoost 
Classifier. Academics have used a variety of datasets in their research on breast cancer, including the SEER 
dataset, mammography images preserved in databases, the Wisconsin Dataset, and datasets from other 
institutions. Authors may complete their study by extracting and selecting unique data from these databases. 
These are fascinating research. The author uses 3D images to demonstrate the categorization of breast cancer 
using several supervised machine learning algorithms, concluding that SVM is the best choice overall [4]. 
Research on a comparison study of Relevance vector machine, on the other hand, has revealed that RVM is 
better to other machine learning algorithms for identifying breast cancer even when the variables are limited 
and reaches 97 percent accuracy [5]. In compared to other machine learning approaches utilized for breast 
cancer diagnosis, RVM has a low processing cost. The Support Vector Machine (SVM) predicts and detects 
breast cancer with the highest accuracy and lowest error rate [6]. Our research focuses on evaluating 
machine learning approaches and algorithms to determine the best strategy for breast cancer detection and 
prediction. 


The body of this research paper is structured as follows. Section 2 describes the techniques and findings of 
prior research on breast cancer diagnosis. The proposed methodology and recommended technique for 
research is described in Section 3. Section 4 provides and elaborates on the outcomes of the experiments. 
Section 5 brings the paper at end with conclusion. 


2. LITERATURE REVIEW 


For the purpose of identifying and forecasting breast cancer, there are several machine learning techniques 
accessible. A few machine learning techniques include the Random Forest, Logistic Regression, Xtreme 
Gradient, and AdaBoost Classifier. A number of datasets, including the SEER dataset, mammography 
picture databases, the Wisconsin Dataset, and datasets from other institutions, have been used by several 
researchers to study breast cancer. To finish their inquiry, authors extract and choose distinguishing traits 
from numerous databases. These are significant studies. The author employs 3D images to demonstrate the 
use of several supervised machine learning algorithms in the categorization of breast cancer, and he 
concludes that SVM is the best choice based on his overall performance [4]. Research comparing Relevance 
vector machine to other machine learning approaches for breast cancer diagnosis, on the other hand, finds 
that it has a low processing cost. This helps to explain why RVM outperforms other machine learning 
algorithms for identifying breast cancer, even with less variables, and reaches 97 percent accuracy [5]. 
Support vector machine (SVM) demonstrates its value in breast cancer diagnosis and prediction with the 
highest accuracy and lowest error rate [6]. Our study examines multiple machine learning methodologies 
and algorithms in order to determine the best choice for early breast cancer detection and diagnosis. 


A potent data analytics strategy that may be effective with datasets associated to breast cancer. The 
technique takes into account both cancer survivability and patient survival. For identification, the 
Surveillance, Epidemiology, and End Results (SEER) tool was used, and for classification, the Self- 
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Organizing Map (SOM) and the (DBSCAN) tools were used. Table 1 lists the objectives and methods of 
current, relevant studies. 


Recent Related Studies. 
Table. 1: The relative study of the Past research work. 


Objective Reference Methods and Approaches Adopted 
[07] Related Work 
[08] Naive Bayes, decision tree, support vector machine 
Clarification of data [09] neural networks, SVM, decision tree, Naive Bayes 
science and applications [10] k-nearest neighbors , SVM, naive Bayes, decision tree 
(c4.5), 
[11] Neural network, c4.5 decision tree, Naive Bayes 
[12] Neural networks, Logistic regression, nearest neighbors 
,decision tree, 
[13] Naive Bayes classifier, Support vector machine (SVM), 
Prediction of breast cancer adaboost tree, artificial neural network (ANN), 
[14] The J48 decision trees and Naive Bayes. 
[15] Multilayer-perceptron , Naive Bayes and support vector 


machine-sequential minimal optimization, 


3. METHODOLOGY 


Since our main objective in doing this research was to find the most effective and accurate approach for 
detecting breast cancer, we employed machine learning and ensemble classifiers. Random Forest, Logistic 
Regression, Xtreme Gradient, and AdaBoost Classifier are utilized to assess the results and establish which 
model is more accurate using the Breast Cancer Wisconsin Diagnostic dataset. 
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Figure 2. The proposed methodology of our research work describe the overall process from dataset start to 
its final result. 


Pre-processing, which is broken down into four stages and includes data cleaning, attribute selection, target 
Role creation, and feature extraction, is the second step in our strategy, as shown in Figure 2. Data collection 
is the first step in our strategy. The machine learning model based on the processed data was successful in 
identifying breast cancer using a fresh set of measurements. In order to assess the performance of the 
algorithms, we continuously feed the model new data along with labels. This is often accomplished by 
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utilizing the Train Test Split approach to split the labelled data that we have gathered into two halves, with 
the ratios 70:30 being used for training and testing, respectively. 


3.1. Algorithms 


In our study, we realize the ensemble classifier and machine learning predictive analysis. The algorithms 
employed in our project are as follows: 


> The class that would correspond to the mean of each prediction is predicted by random forests regression 
after training a large number of decision trees for classifying or regression. Random choice forests are a 
solution to the issue of overfitting a training set for decision trees. 


> Logistic regression, a form of linear regression, is a particularly successful modelling approach [16]. 
Logistic regression is used to anticipate the chance of an illness and other health problems based on a 
risk factor (and variables). Using both basic and multivariate logistic regression, the relationship between 
an independent variable (s) (Xi), also known as the exposure and predictor variables, and a binary 
dependent variable (Y), also known as the outcome or a response variable, is investigated. It is often 
used to forecast binary or multiclass dependent variables. 


> Extreme Gradient Classifier: Gradient boosting is an ensemble machine learning technique that may be 
used to problems with classification and regression predictive modelling. A quick open-source variation 
of the gradient boosting technique is called "extreme gradient boosting," or "XGBoost." Because of this, 
XGBoost is an open-source project, a Python library, and an algorithm. It is meant to have very high 
computational efficiency, maybe exceeding the most current open-source versions. The technological 
goal of going above the computational resource limit for boosted tree algorithms is referred to as 
"Xgboost." 


> One of the meta-estimators, the AdaBoost Classifier, fits multiple instances of a classifier on the same 
dataset, modifies the weights of instances that were incorrectly classified, and then applies the classifier 
to the provided data [17]. This allows subsequent classifiers to focus on challenging cases. 


3.2. Dataset acquisition 


We used the Breast Cancer Wisconsin Diagnostic data from the University of Wisconsin Hospitals Madison 
Breast Cancer Network to carry out our study [14]. The characteristics of the dataset were extracted from a 
digital picture of a breast cancer sample obtained by fine-needle aspiration (FNA). These features specify the 
characteristics of the cell nuclei in the picture. Wisconsin has recorded 569 cases of breast cancer as shows 
in figure 3, with two categories (62.74 benign and 37.26 malignant) and 11 integer-valued characteristics (- 
Id -Texture -Area -Perimeter -Diagnosis -Radius -Smoothness -Compactness -Concavity -Perimeter - 
Concave points). Fractal dimension and symmetry). 


diagnosis 


Figure 3. Data from Wisconsin Breast Cancer Diagnostic reveals that 357 benign( B) cases and 212 
malignant (M) cases have been diagnosed. 


3.3. Experiment Environment 


The Scikitlearn package and the Python programming language were used for all of the testing on the 
machine learning algorithms discussed in this research. Scikit-learn, sometimes referred to as sklearn, is a 
Python-based open source machine learning toolkit [17]. It is designed to operate with Python's NumPy and 
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SciPy scientific and numerical libraries, and contains the classifiers Random Forest, Logistic Regression, 
Xtreme Gradient, and AdaBoost Classifier. 


4. RESULTS AND DISCUSSION 


The use of machine learning methods on the Wisconsin Diagnostic breast cancer dataset. We analyzed and 
assessed the models using the performance criteria Accuracy and AUC Score in order to choose the best 
algorithm for breast cancer prediction. A method for assessing classification task success when the result 
may be two or more unique classes is the confusion matrix as shows in figure 4 . An example of a confusion 
matrix is a table having the columns "Predicted," "Actual," "False Negatives (FN), "False Positives (FP)," 
and "True Positives (TP)" and "True Negatives (TN)". Accuracy is the most typical performance metric for 
classification algorithms. The ratio of occurrences that were accurately anticipated to all other expected 
events is how it is defined. The amount of precise documents that our ML model successfully discovered 
when doing document retrieval is known as precision. The sensitivity of a machine learning model refers to 
how many successful outcomes it produces. The Fl score offers a harmonic mean of accuracy and 
sensitivity. The weighted average of accuracy and sensitivity is used to get the Fl score. The accuracy 
percentages for the Wincson Breast Cancer Diagnostic datasets are shown in Tables 2. The results of the 
training and testing sets affect the accuracy of each classifier, although the logistic regression has a greater 
accuracy of 96.491 and AUC Score is 0.994 than other classifiers as shows in figure 5. 


a. XG Boosting b. Random Forest 


= 


c. Ada Boost d. Logistic Regression 


Figure 4. Illustrate the result of the models includes ‘a’ XG Boosting, “b’ Random Forest Classifier, ‘c’ Ada 
Boost and ‘d’ Logistic Regression. On the basis of confusion matrix. 
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Table 2. Comparison of Models on the basis of Accuracy and AUC Score 


Sr No Model Name Accuracy (%) |AUC Score 
1 Logistic Regression 96.491228 0.994709 
2 AdaBoost 95.321637 0.992798 
3 Random Forest Classifier 94.152047 0.985670 
4 XGBoost 91.812865 0.982216 


e. XG Boosting f. Random Forest 


g. Ada Boost h. Logistic Regression 


Figure 5. It shows the Area under curve for all model by representing ‘e’ by XG Boosting, ’f Random 
Forest , ‘g’ Ada Boost and ‘h’ for Logistic Regression. AUC score range in value from 0 to 1, value close to 
1 means better prediction and score close to 0 shows wrong prediction [18]. 


5. CONCLUSION 


In order to calculate, analyses, and evaluate the various outcomes obtained based on accuracy, and AUC 
score, we employed four main algorithms to the Wisconsin Breast Cancer Diagnostic dataset (WBCD): 
Random Forest, Logistic Regression, Xtreme Gradient, and AdaBoost Classifier. Finding the most reliable, 
accurate, and high-accuracy algorithm was the goal. With the help of the scikit-learn package and the 
Anaconda environment, all algorithms were created in Python. A detailed study of our models shows that, 
with an 70:30 train-to-test split, logistic regression outperforms all other techniques with a greater efficiency 
of accuracy 96.49%, and AUC score of 0.9947. Additionally, logistic regression has shown its effectiveness 
in the recognition and forecasting of breast cancer, obtaining the best accuracy. It is important to keep in 
mind that all of the results are solely linked to the WBCD database in order to validate the findings collected 
from the database. Therefore, it is crucial to think about how similar techniques and strategies may be used 
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to other databases in future projects. Our next attempts will also make use of fresh variables and machine 
learning techniques on sizable data sets. 
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